RNA secondary structure formation: a solvable model of heteropolymer folding 
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The statistical mechanics of heteropolymer structure formation is studied in the context of RNA 
secondary structures. A designed RNA sequence biased energetically towards a particular native 
structure (a hairpin) is used to study the transition between the native and molten phase of the 
RNA as a function of temperature. The transition is driven by a competition between the energy 
gained from the polymer's overlap with the native structure and the entropic gain of forming random 
contacts. A simplified Go-like model is proposed and solved exactly. The predicted critical behavior 
is verified via exact numerical enumeration of a large ensemble of similarly designed sequences. 
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A biopolymer such as a DNA or protein is a heteropoly- 
mer. It consists of different types of monomers con- 
nected linearly in a specific order. Interactions among 
the monomers give each polymer a robust three dimen- 
sional structure on which its biological function depends. 
This sequence-to-structure relation is rather simple in the 
case of complementary DNA strands, but can be very 
complex in the case of proteins. The latter has been in- 
tensively studied in the last decade using many different 
approaches 

A number of important ingredients are involved in de- 
termining the structure of a heteropolymer. They include 
(i) thermal fluctuations which "denature" the polymer 
into a random coil at high temperatures, (ii) monomer- 
specific binding which freezes a random heteropolymer 
into a "glass" at low temperatures, and (iii) sequence 
correlation which biases the polymer into a certain spe- 
cific (nonrandom) structure, commonly referred to as the 
"native" structure. The native structures are selected in 
nature by evolution, but can also be obtained artificially 
through sequence design |^,^. The interplay of these in- 
gredients leads to a number of phases depending on the 
environment (e.g., the temperature) and the extent of se- 
quence correlations or design. The nature of these phases 
and the transitions among them have been discussed in 
the context of protein folding ^j^. However, protein- like 
models are not ideal systems to study phase-related is- 
sues because proteins are rather short (typically under 
500 monomers), and their thermodynamic limit is am- 
biguous: simply taking longer proteins leads to problems 
that are neither interesting nor relevant. 

Here, we shall take the view that the statistical me- 
chanics of a long heteropolymer governed by the compet- 
ing interactions mentioned above are of interest in their 
own right, regardless of their immediate application to 
protein folding. We will study the molecule RNA, an in- 
teresting biopolymer which has a mixture of protein-like 
and DNA-like properties 0. Due to the nature of the 
physical interaction between the monomers of a RNA, 
aspects of the RNA structure formation problem are con- 



siderably easier to treat than protein folding. Also, the 
thermodynamic limit is more meaningful for the RNA, 
which can contain over 50,000 monomers. In this paper, 
we will describe the simplest effect of sequence bias to the 
formation of RNA secondary structures (defined below). 
We will focus on the transition between a designed native 
structure and the RNA's "molten phase", a thermalized, 
collapsed phase which exists in an intermediate temper- 
ature regime in between denaturation and freezing. We 
will introduce a simplified two-state model to describe 
the effect of sequence bias, following N. Go's approach 
for proteins Q . We will solve the model exactly, and de- 
rive the critical properties of the transition between the 
native and the molten phase. The applicability of our 
two-state model to the native-molten transition of de- 
signed heterogeneous RNA sequences is verified by direct 
numerical enumeration and finite-size scaling analysis. 

RNA is a polynucleotide chain consisting of the four 
"bases" A, U, G, and C. Energy of the order of several 
/cbT's can be gained by forming complementary pairs 
(i.e., A — U and C — G) and then stacking them in a dou- 
ble helical structure similar to a double-stranded DNA. In 
order to form these base pairs, the RNA will need to bend 
back onto itself at various locations, resulting in a number 
of helical segments. These helices are then arranged in 
a three dimensional, so-called "tertiary" structure, stabi- 
lized by the much weaker interaction between the helices. 
Due to this crucial separation of energy scales for the 
RNA, it is possible to distinguish between the formation 
of "secondary" and tertiary structure: a RNA secondary 
structure is a collection of base pairings, with the restric- 
tion that any two base pairs (ii, ji) and (?2, ^2) have to be 
either "nested" (i.e., ii < 1-2 < j2 < ji) or "independent" 
(i.e., ii <ji <i2 <j2) Interactions violating these 

rules lead to structural elements which typically cannot 
form simple double-helices. Thus, they are energetically 
or kinetically suppressed and deemed part of the tertiary 
structure. Each such secondary structure can be rep- 
resented by a non-crossing arch diagram [see Fig. ^(a)], 
where a pairing between the bases i and j is indicated 
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by a dashed line connecting i and j on a stretched back- 
bone. Fig. ^b) shows an alternative representation of 
the same structure; here the backbone is bent and the 
dashed lines are short, in order to convey a sense of the 
backbone topology. The regions with consecutive base 
pairings form the above mentioned double-helices. 



a) 




FIG. 1. Representations of the secondary structure of a 
RNA: (a) a non-crossing arch diagram; (b) a helix diagram. 
The dashed lines indicate base pairings. 

To study the thermodynamic ensemble of all possible 
secondary structures of a given RNA molecule, the energy 
of each structure needs to be specified. Here we shall take 
the simplest energy function, with w(6, b') for each pairing 
of the bases (6, 6'), and vq — —sqT for each unpaired base 
mimicking the entropy gained from unbinding [T^ j . Accu- 
rate energy parameters including the effects of stacking, 
loops, etc. should be used for predicting actual sec- 
ondary structures of real RNA molecules. Due to their 
irrelevance for the asymptotic properties studied here 
we neglect them for simplicity. In fact, our parameters 
v{b, b') should be viewed as coarse-grained quantities de- 
scribing the energy of pairing two short segments of bases. 

The class of RNA secondary structures is clearly hi- 
erarchical and belongs to the class of Hartree diagrams 
widely used in the self-consistent treatment of many- 
body quantum systems. [In this sense, one may regard 
the RNA secondary structure problem as the Hartree the- 
ory of the "full" (protein-like) heteropolymer problem.] 
The recursive nature of the diagrams allows efficient com- 
putation of the exact partition function of an arbitrary 
sequence: Consider a segment of bases from the positions 
i to j > i inclusive. The base at j can be either unbound 
or bound to any base k S {i, . . . , j — 1}. For the sim- 
ple energy function we have adopted here, the partition 
function Zij for this segment of bases then obeys 
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where we take Sij =v{bi, bj)—2vo. The partition function 
Zi jv of a strand of length N can be computed recursively 
using (|l|) (with Z^,, = = 1) in 0{N^) time 

Before we discuss the effect of structure formation 
due to sequence bias, we first give a qualitative descrip- 
tion of the behavior of an uncorrelated random RNA 
sequence The energetics of this system is deter- 

mined by the mean eo{T) and standard deviation Se of 
the pairing energies e^j . Denaturation occurs at a tem- 
perature where eo(Id) « 0, since for large and positive 



the unbound state is preferred Below the de- 

naturation temperature Td, the bases of the RNA are 
mostly paired together. There, the system can take on 
two possible phases: At very low temperatures (T<cfe), 
heterogeneity of the sequence is important, forcing the 
polymer to adopt the optimal base-pairings which mini- 
mize the total energy; this is the glass phase At 
intermediate temperatures (Td > T ^ Se), differences in 
the binding energy are less important while an average 
attraction between the monomers still exists. There, the 
entropy of forming different pairings becomes dominant, 
resulting in the molten phase. In a separate study [ p5[ , 
we will demonstrate the perturbative irrelevance of weak 
sequence inhomogeneity in the intermediate tempera- 
ture regime, thereby establishing the stability and self- 
consistency of the molten phase. Furthermore, as se- 
quence heterogeneity is irrelevant in the molten phase, 
statistical properties in this phase can be obtained from 
(|l|) by simply taking Sij — Sq, where Eq < can be in- 
terpreted as an effective mean attraction. Here, we will 
take the existence of such a molten phase as a conjecture, 
and examine the effect of sequence bias. 

To do so, we need to construct a sequence with a dom- 
inant native structure. For simplicity, we will take the 
native structure to be a hairpin with a long stem, a struc- 
ture which has been studied experimentally for oligopep- 
tides and short RNA ||l8| , although we will consider 
here the limit where the stem is long. In this native 
structure, the bases (1, 2L), {2,2L - I), . . . , (L, L + 1) of 
a length N = 2L sequence are paired. We call these 
the "native pairs" or "native contacts". Bias towards 
this structure can be "designed" into the sequence by 
choosing a random sequence for the bases 1 to L of the 
molecule and then taking the second half of the molecule 
{L + 1 to 2L) to be the exact reverse complement of 
the first half. The perfectly complementary native pairs 
then make the native structure the "ground state" of the 
system. Upon increasing temperature, the entropy of 
forming non-native pairings will compete with and hence 
weaken the effective bias of the native structure. Alterna- 
tively, this bias can be weakened by random "mutations" 
of the designed sequence. For sufficiently weak effective 
bias, the RNA can "melt" from its native structure into 
any of the denatured, molten, or glass phase, depending 
on the temperature and the strength of the bias. 

To study the native-molten transition of the designed 
sequence analytically, we shall describe the pairing ener- 
gies and the bias by a simple two-state model. 



£ Si+j,2L + l + So, 



(2) 



for designed sequences of length 2L. The first term in 
(^) (with e < 0) mimics the additional attraction of na- 
tive pairs due to sequence design; \e] characterizes the 
"strength" of design which can be controlled by the "mu- 
tation" process mentioned above. The second term de- 
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scribes the average attraction of the "background" char- 
acteristic for the molten phase. The two-state model (||) 
is conceptually similar to the one introduced by N. Go in 
the context of protein folding js) . While Go proposed this 
model to simplify the numerical simulation of lattice pro- 
tein models, we will show that this model actually gives 
a quantitative description of the native- molten transition 
of the RNA secondary structure problem (|l]). 

Our task now is to study the system defined by Eqs. (|l|) 
and (^). We begin with a description of the molten phase, 
with £ in (||) set to zero. This phase is described by a 
single parameter, q = e"^"/"^. Due to the translational 
invariance of the uniform interaction, the partition func- 
tion can be written as Zij = Zo{j — i + 2; q) . In terms of 
the Laplace transform Zo(/i; q) = J27Li ^o(^; g)e^^^, the 
recursion relation (|l|) takes on the simple form Zq^ = 
e'^ — 1 — qZQ. Inverse Laplace transforming this solution 
in the limit of large i using the saddle point method, one 
finds the asymptotic form [p|Jl^ 



Zo{e;q) = A((7)r^e''''(9)^ 



(3) 



with 9 — 3/2 and ^0(9) = log(l + 2y/g). Physically, 
the partition function describes the configurational en- 
tropy of forming different secondary structures. Each 
such structure can be viewed as a configuration of an an- 
nealed (and rooted) branched polymer | p^ , p^ , as Fig. |l| 
suggests. Note that the exponent 6* = 3/2 is an impor- 
tant characteristic of the molten phase and will play a 
key role in what follows. 

We now include the additional energetic bias £ in (H) 
due to sequence design. For a RNA sequence of 2L bases, 
observe that each secondary structure consists of a series 
of native pairings, e.g., (ii, 2L — ii + 1), (12, 2i — 12 + 1), 
etc., separated by "bubbles" of lengths £k = ik+i — ik 
containing only non-native pairings of the intervening 
bases. Let the Boltzmann weight of each native pairing 
heq = eT'^l'^ ^ and let the restricted partition function de- 
scribing all possible non-native pairings in a molten bub- 
ble of length i be W(l\q). The total partition function 
Z(L -I- 1; g, g) for the model (||) can then be conveniently 
written in Laplace space as 



Z{y.-q,q)^W{^-q) Y.[viW{li\q) 



(4) 



which is easily summed. Here, q) and Z{ii;q,q) 

are respectively the Laplace transform of W{£; q) and 
Z{L; q, q). W{p, q) is completely specified by Eq. (|^) and 
the "boundary condition" 



Z{L + l;q^l,q) = Zo{2L + l;q). 



(5) 



Before we proceed with an analytical solution of this 
system, let us observe that Eq. (|^) is mathematically 



very similar to the equation derived by Poland and Scher- 
aga ]20| describing the thermal denaturation of perfectly 
complementary DNA double strands. The latter descrip- 
tion in turn is a refined version of the helix coil transi- 
tion proposed by Zimm pl[ [Note however that while 
the helix-coil transition and the DNA denaturation tran- 
sition consider only the interaction of native pairs, the 
RNA folding problem considered here includes interac- 
tions between bases far apart along the backbone of the 
chain.] The mathematical similarity can be made clearer 
if one assumes (as it will turn out to be the case) that 
the number of native pairings in each molten bubble is 
a negligible fraction of the total number of pairings, i.e., 
W{t,q) w Zo{2£ - l-q). Then, from the form of (|), 
one can think of W{t) as the Boltzmann weight of a 
"Gaussian polymer loop" of length i in d-dimensions, 
with the fictitious dimension d given by 29. Thus, the 
molten bubbles described by W are analogous to the de- 
naturation bubbles in the standard denaturation prob- 
lem, or to the coiled regions in the helix coil transition. 
They all represent the entropically favored phase but the 
origin of these entropies is quite different: The denatu- 
ration bubbles are formed by the configuration entropy 
gained by the unconstraint chains, while the molten bub- 
bles are driven by the "branching entropy" of secondary 
structures within the molten phase. In any case, we 
expect that the native-molten transition belongs to the 
same universality class as the denaturation transition of 
Ref. 0, with d = 20^3. 

The partition function Z{L;q,q) can actually be ob- 
tained exactly in the limit of large L, although the de- 
tails are somewhat tedious |14 . First, by using Eqs. (^) 
and (^), an exact expression for Z{iJ,;q,q) for arbitrary 
q can be derived. It is straightforward to extract the 
reduced free energy f{q,q) = — {log Z)/L from its singu- 
larities. It is —2fio{q) with /io('z) given in (^ for q < (jc 
and some other implicitly known function —fii{q,q) for 
q > qc- The critical point is at qc = (3-,yi + 2^/g — 
1)/(^1 + 2y^ — 1) > 1. Expanding around 5c yields 
f{q)^fiqc) ^ {q^qc)^7 which implies a continuous phase 
transition with a finite jump in the specific heat at the 
critical point; thus, the specific heat exponent is a = 0. 
From the free energy, we can easily compute the average 
fraction of native contacts, Q = — d//dlng, which con- 
stitutes the order parameter of the phase transition. In 
the thermodynamic limit, Q — Q tor q < qc and Q = 1 
for §'3> §c- Close to the critical point, Q ^ (q — qc). 

For strands of finite length L, this length always enters 
the saddle point equation involved in the inverse Laplace 
transform of Z{^) in the combination L{q — qcY with 
1/^2 jl^ . The finite-size result can be cast into the form 
Q{L) = L^^/'^ g [{q — qc)L^^'^] in the vicinity of the criti- 
cal point. The scaling function g[y] can be computed ex- 
actly numerically, with g[y] ^ y for y 3> 1 and g[y] 1/y 
for y < -1, with g[y] 0(1) for |?;| < 1. 
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In order to verify whether the above critical behav- 
ior, as derived from the simphfied system (|l|) and (||) 
describes the native-molten transition of designed het- 
erogeneous sequences, we numerically iterated Eq. for 
perfectly designed sequences [p2j . The energies Eij 's were 



chosen to be —1 for complementary pairs and otherwise. 
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FIG. 2. Numerical results on RNA sequences with bias for 
a hairpin, (a) Specific heat as a function of temperature for 
different system sizes. The vertical line indicates the critical 
temperature; the inset sketches the same quantity from the 
exact solution of the Go-like model, (b) Scaling plot of the 
fraction Q of native contacts with the exact solution for the 
scaling function g[y] of the Go-like model as the solid line. 

Fig. ^(a) shows the specific heat as a function of the 
temperature for perfectly designed sequences of 200 to 
1600 bases averaged over an ensemble of 100 realizations 
of randomness. Direct extraction of critical exponents 
from this data is difficult due to the strong correction- 
to-scaling effects of the expected discontinuity (a = 0). 
However, for a = 0, a good numerical estimate of the 
critical temperature Tc can be obtained from the com- 
mon intersection point of the curves at different lengths. 
This is more clearly seen from the result of the Go-like 
model (H); see inset of Fig. ^(a). The fraction Q of native 
contacts can then be used for a detailed scaling analysis. 
As shown in Fig. ||(b), the scaling plot of QL^^"^ versus 
L^/^(Tc — T)/Tc is consistent with the predicted criti- 
cal behavior around the phase transition and is well de- 
scribed by the scaling function g[y] of the Go-like model. 

To summarize, we analyzed the heteropolymer struc- 
ture formation problem in the context of RNA secondary 
structures. The native-molten structural transition re- 
sults from a competition between the energetic gain of na- 
tive contacts and the "branching entropy" of the molten 
phase. Critical properties can be obtained exactly af- 
ter introducing an approximate two-state model a la Go; 
the validity of the approximation is verified by direct nu- 
merical calculation of designed sequences. Throughout 
this study, we have neglected the effect of the excluded- 
volume interaction p3| . This effect changes the value 
of the exponent 6 [ p4| , hence changing the universality 
class or even the order of the phase transition in 3 di- 
mensions as will be discussed elsewhere p^ ; it however 
does not change the qualitative physics of the competing 
interactions discussed here. The latter should be accessi- 
ble experimentally using the recently developed molecu- 
lar beacon technique |p5t| . Finally, our study should also 



be relevant to the statistics of heteroduplex formation 
which occurs in the hybridization of partially complemen- 
tary heterogeneous DNA strands . In this regard, the 
structural transition discussed here resembles the physics 
of similarity detection discussed previously in the context 



of DNA sequence alignment |27|. 
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